System and method for transcoding digital content

ABSTRACT

A system and method for transcoding digital content (e.g. web content) by correctly employing one annotation for multiple digital contents. This can efficiently reduce the workloads required for the addition of annotation data during the transcoding process. A transcoding system comprises an annotation database system for storing annotation data to be used for the transcoding of contents, and a transcoder for transcoding the contents based on annotation data stored in the annotation database system. Upon receiving an inquiry from the transcoder, a correlation between elements in the contents and descriptions of the annotation data is checked to select one annotation that can be employed for transcoding the content. The correlation is specifically determined based on XPath information.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of and claims priority under 35U.S.C. § to U.S. patent application Ser. No. 10/233,093 (“SYSTEM ANDMETHOD FOR TRANSCODING DIGITAL CONTENT”) filed Aug. 28, 2002, whichclaims priority under 35 U.S.C. § 119 to Japanese Patent Application No.259846 filed Aug. 29, 2001, the entire contents of which areincorporated herein by reference.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates to a technique for transcoding information(e.g. digital content, such as a web page) on a network and fordistributing the transcoded information, and in particular to atranscoding technique based on an annotation prepared for theinformation.

2. Description of the Related Art

When access to certain information on a network is requested by apredetermined terminal device, the desired information can be converted,in accordance with specifications of the terminal device or its useenvironment, to be presented to the terminal device. The conversiontechnique is called “transcoding” technique. For example, to provide webcontent on the Internet, the structure of a web page can be adjusted bytranscoding, thereby permitting the web content to be fitted into thesmall display screen of a portable information terminal, or thestructure can be altered and adapted for use by a speech browser forvoice synthesis.

Roughly speaking, there are two transcoding methods. One is a method forwhich no additional information is employed. The other is a method usingexternal meta information (annotation). According to the transcodingmethod for which no additional information is employed, all web contentscan be transcoded, regardless of the types and contents of the web data.However, because the types and contents of web data are not taken intoaccount, the transcoding accuracy is low. On the other hand, accordingto the transcoding method based on annotation data, since an appropriatetranscoding method is performed based on annotations that correspond toweb contents, the transcoding accuracy is high. However, since muchlabor and high costs are required to input meta information forannotation, annotation information cannot be added to all web contents,and the number of web contents that can be transcoded is limited.Therefore, in order to transcode more web contents at high accuracy,what is important is how workload for adding an annotation should bereduced.

FIG. 1 is a diagram for explaining the system configuration forperforming transcoding based on annotations. In FIG. 1, a transcodingsystem comprises: a transcoder 910 for converting (transcoding) webcontent; and an annotation database system 920 in which annotation filesused for transcoding is stored. In FIG. 2, when a terminal device 940issues an access request to a web server 930, the web server 930 returnstarget web content to be accessed, and the transcoder 910 receives theweb content first. The transcoder 910 refers to the annotation databasesystem 920, and transcodes the web content based on data, contained inan annotation file (and hereinafter referred to simply as anannotation), that corresponds to the web content. Thereafter, theobtained web content is transmitted by the transcoder 910 to theterminal device 940.

As a countermeasure for reducing the workload required by the thusarranged system to add an annotation for the transcoding process, it isimportant that an annotation authoring tool be prepared. Further, oneannotation may also be employed for different web contents having thesame layout. The conventional methods for correlating one annotationwith multiple web contents can be sorted into three types.

1. The correlation between URLs (Uniform Resource Locators) andannotations is stored as table data (correlation table data).

2. A regular expression of URL is employed.

3. An annotation to be employed is dynamically determined by using atable structure of the web content (an automatic determination).

As is described above, when conversion using the transcoding techniqueis performed to provide information on a network, the transcoding methodbased on annotations is employed in order to attain high transcodingaccuracy. However, since many workloads and high costs are required forthe input of meta information for annotations, a typical network system,such as the Internet, cannot add an annotation to all the information,i.e., all the web contents, and the number of web contents that can betranscoded is limited. In order to reduce the workloads required to addan annotation, the above described method for correlating one annotationwith multiple web contents has been proposed. However, for the method 1whereby the correlation between URLs and annotations is stored as tabledata, it is not practical for the table content to be updated frequentlyin order to cope with new URLs that are generated day after day.Therefore, this method cannot be employed especially for a web page usedfor describing news articles or search results obtained by a searchengine.

For the method 2 using the regular expression of URL, the author of anannotation must analyze the URL structure of a web site and describe acomplicated regular expression, so a great deal of workloads arerequired. Further, this method cannot cope with web contents whoselayouts are dynamically changed using cookie data. If the method using aregular expression of URL is employed together with an XPath wildcarddesignating a specific portion of an HTML document, the web contentwhose layout is to be changed dynamically can be coped with to someextent. In this case, overall, the URL structure of the web site isthoroughly analyzed, and a URL condition on which the same layoutappears is determined. And if the web content cannot be handled by theregular expression, the XPath wildcard is employed to provide a wideruse of the method for various purposes.

FIGS. 2A and 2B are schematic diagrams showing example layouts for a webpage on which news articles are described.

The layout in FIG. 2A differs from the layout in 2B in that a table “Topnews” is inserted. The “Top news” table is arbitrarily added or deletedby a person acting as a web content manager. In this case, assume that aregular expression can be obtained for a URL that specifies the two webpages in common in FIGS. 2A and 2B, and that the XPath for the web pagesis written as follows.

/html[1]/body[1]/table[7]/tbody[1]/tr[1]/td[3]/table[1] If a wildcard isintroduced in order to add or delete the “Top news”, the XPath iswritten as follows./html[1]/body[1]/table[7]/tbody[1]/tr[1]/td[3]/table[starts- with(child::tbody[1]/tr[1]/td[1]/table[1]/tbody[1]/tr[1]/td[1], ‘▪Top news’)]

However, since these operations are so complicated and the descriptionof the XPath also becomes complicated, a lot of workloads are imposed onthe author of the annotation. Furthermore, although the method foremploying the XPath wildcard to change the layout can cope with a simplechange, such as the addition or deletion of a visually semantic block (aheader, a footer, a link list, main text and an advertisement;hereinafter referred to as a group) that is an element or component ofthe web content and is represented by a certain layout (e.g. abackground color), it is difficult to handle a major change affectingthe entire layout.

Further, even for specific web contents at the same URL, the layout maybe dynamically changed based on other web contents that have been passedthrough before the specific web contents are reached. Similarly, thelayout may be dynamically changed by re-loading the web content usingthe same URL. In these cases, to add an annotation, using the regularexpression of URL is not sufficient to handle them, and the XPathwildcard must be employed. However, when there is a major change in thelayout, it is difficult for such change to be handled with by the XPathIn addition, there are many web pages on which the results obtained by asearch engine are displayed. The layout of such pages tends to bechanged greatly, depending on whether a search target (a page, aproduct, a book, etc.) corresponding to a matched keyword is present ornot. In this case it is also difficult to cope with the web pages bytraditional way of using the regular expression of URL and the XPath.

Furthermore, in the method 3 for correlating one annotation withmultiple web contents by employing the table structure of web contentsto dynamically determine which annotation is to be used, the table usedfor specifying a layout is employed as criteria (references) fordetermination. Thus, an appropriate annotation cannot be determined whena table in a web content is not used for a layout purpose, or when alayout having the same form but different content is employed. If thedetermination criteria is more strictly applied in order to avoid anerroneous determination (e.g. different layouts are regarded as beingthe same), layouts that are basically the same may be judged to bedifferent and an erroneous determination could not be avoided.

It is, therefore, one object of the present invention to correctlyemploy an annotation for multiple web contents and to thus efficientlyreduce the workloads required for adding an annotation during thetranscoding process.

It is, therefore, one object of the present invention to correctlyemploy an annotation for multiple web contents and to thus efficientlyreduce the workloads required for adding an annotation during thetranscoding process.

It is another object of the present invention to provide a tool forsimplifying the addition of an annotation to web content.

SUMMARY OF THE INVENTION

According to the present invention, a system is provided wherein, duringthe transcoding process, an appropriate annotation is selected fromamong annotations stored in an annotation database, so that theannotation can be correctly employed for multiple web contents.

To achieve this object, according to the present invention, a system fortranscoding digital content is provided. The system comprises: adatabase system for storing annotations to be used in a transcodingprocess; and a transcoder for transcoding the digital content based onan annotation stored in the database. The database system selects theannotation based on correlation between elements in the digital contentand descriptions of the annotations. The system descriptions of theannotations may include descriptions for specifying certain portions ofdigital contents, which is typically “XPath”. If a plurality ofannotations that can be applied to the digital content are found, thedatabase system may select the annotation that includes the descriptionsof as many elements in the digital contents as possible.

Furthermore, the present invention can be implemented by providing, forthe web server, the function of the above described transcoding system.Specifically, a web server comprises: contents storage means for storingcontents; annotation file storage means for storing annotations;transcoding means for employing correlation between the layout ofelements of the contents and the descriptions of the annotations toobtain an annotation that can be employed for the contents to beprocessed, and for transcoding the contents; and transmission means fortransmitting contents obtained by the transcoding.

According to the present invention, a method for transcoding digitalcontent is also provided. This method comprises the steps of: obtainingthe digital content; reading annotations to be used in a transcodingstep from a database; determining an annotation corresponding to thedigital content based on correlation between elements in the digitalcontent and descriptions of the annotations; and transcoding the digitalcontent based on the annotation determined at the determining step. Thedescriptions of the annotations may include descriptions for specifyingcertain portions of digital contents, which is typically “XPath”. Theinvention can also be implemented by a program product executable on acomputer for performing the above-mentioned method of transcoding thedigital content. This program product can be distributed by being storedon a recording medium, such as a magnetic disk, an optical disk or asemiconductor memory, or by being transmitted across a network.

According to the present invention, annotation data stored in anannotation database system has the following structure. Annotation datais stored in annotation files that are prepared for units of contents,and includes descriptions for the transcoding process. This data iscorrelated with a layout of elements in the digital contents, typicallyusing XPath. The annotation files are roughly sorted based on locationinformation of the contents on a network, which is typically URLs(Uniform Resource Locators) that are schematically described. Further,the annotation data may include information for identifying an element(optional group) in the digital content for which a layout change isplanned.

Furthermore, in order to simplify the process for adding an annotationto web contents, an annotation management apparatus or program productis provided that serves as a tool for the correlation with annotations.The apparatus or a program product for managing annotation data to beused for transcoding digital content performs the method comprising thesteps of: evaluating a correlation between elements in the digitalcontent and a description in the annotation data; and presenting ainterface to show a state of the correlation between the digital contentand the annotation data, based on evaluation results obtained by theevaluating step. The interface may provide, for each element of thedigital content, a list for displaying whether the description of acorresponding annotation data is present. This interface also mayprovide a display component on which a detailed correlation between thedescription of the annotation data and the elements in the digitalcontent is displayed. The interface further may provide a displaycomponent for accepting an entry from a user and interactivelydisplaying a state of the annotation data corresponding to the digitalcontent based on the entry. This interface may provide a displaycomponent for editing the annotation data.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 (PRIOR ART) is a diagram for explaining an example of a systemconfiguration that implements transcoding based on an annotation.

FIG. 2 is a schematic diagram showing an example layout for a web pagethat carries a news article.

FIG. 3 is a diagram for explaining an example of a system configurationaccording to one embodiment of the invention that implements transcodingbased on an annotation.

FIGS. 4A and 4B are diagrams showing an example of XML descriptions foran XPath that corresponds to an optional group used for the embodiment.

FIG. 5 is a flowchart for explaining an example of the processingperformed by a transcoder.

FIG. 6 is a flowchart for explaining an example of the processing for anannotation database according to the embodiment.

FIG. 7 is a diagram for explaining an example of the functionalarrangement of a site pattern analyzer used for the embodiment.

FIG. 8 is a diagram showing an example operating screen for theannotation management using the site pattern analyzer.

FIG. 9 is a flowchart for explaining an example of the semi-automaticcorrection processing performed when it is determined that multipleannotations can be applied for the same page.

FIG. 10 is a flowchart for explaining an example of the processing foradding an annotation to all web contents in a predetermined web site byusing the site pattern analyzer.

DETAILED DESCRIPTION OF A PREFERRED EMBODIMENT OF THE INVENTION

The preferred embodiment of the present invention will now be describedin detail, while referring to the accompanying drawings.

In order to accurately employ an annotation for multiple digitalcontents, such as web contents, the present invention implements asystem wherein an appropriate annotation is selected from amongannotations stored in an annotation database system, during atranscoding process. Further, in order to simplify the process by whichan annotation is added to web contents, a site pattern analyzer isprovided as an annotation management tool. An explanation will be givenseparately for the system that selects and uses an appropriateannotation at the time of transcoding, and the site pattern analyzer.

FIG. 3 is a diagram for explaining an example of a system configurationaccording to the embodiment that performs transcoding based on anannotation. In FIG. 3, the transcoding system comprises: a transcoder 10for converting (transcoding) digital contents, such as web contents; andan annotation database system 20 in which an annotation file is storedto be used for the transcoding process. The transcoder 10 is locatedbetween a web server 30 for providing original web contents and aterminal device (web client) 40 for requesting the web contents from theweb server 30. In accordance with the specifications and the environmentfor the terminal device 40, the transcoder 10 transcodes the webcontents downloaded from the web server 30, and transmits the contentsthus obtained to the terminal device 40.

In this arrangement, the web server 30 is, for example, a server machinethat is implemented by a computer system, such as a workstation or apersonal computer. The terminal device 40 can be a computer system, suchas a workstation, a personal computer, or an information terminal, suchas a PDA (Personal Digital Assistant) or a mobile telephone, and isconnected to the web server 30 by a network.

The transcoder 10 is a module provided on the network that connects theweb server 30 and the terminal device 40, and its functions are carriedout by the CPU in a computer system, such as a workstation or a personalcomputer, that is controlled by a program. The transcoder 10 may beprovided as an independent computer system that provides a service fortranscoding web contents received from the web server 30, or may be anadded function of a computer system that functions as the web server 30.

The annotation database system 20 is implemented by data recordingmeans, such as a hard disk or a semiconductor memory, and a managementsystem that manages the data recording means. The management system canbe, for example, the CPU of a computer system, such as a workstation ora personal computer, that is controlled by a program. Aboveconfiguration itself is substantially the same as the conventionaltranscoding system.

In this embodiment, upon receiving an inquiry from the transcoder 10,i.e., during the transcoding of web contents, an annotation that can beapplied to the web contents to be transcoded is selected from among theannotation files stored in the annotation database system 20. As aresult, the transcoder 10 transcodes the web contents based on theannotation selected from among the annotation files in the annotationdatabase system 20.

The annotation file is so selected from the annotation database 20 bydetermining whether the Xpath (description for specifying a certainportion or position of the content) in the pertinent annotation can beapplied to the target web contents, i.e., whether the XPath correctlycorresponds to a group or an element (a block having a visual meaningthat is represented by a layout, such as a web contents backgroundcolor, e.g., a header, a footer, a link list, a text, and anadvertisement) in the web contents.

The annotation files in the annotation database system 20 are roughlysorted by the URLs of the web contents. That is, a schematic URL, e.g.the server name or the folder name of a site where the web contents tobe processed are present, is correlated with each annotation file. Tosearch for an annotation to be applied to the web contents to beprocessed, first, the URL of the web contents is employed as a searchkey, and all the annotations that are correlated with such URL areregarded as candidate annotations for the web contents to be processed.And from among these candidates, an appropriate annotation is selectedbased on the relationship between the above described group (element) inthe web content and the XPath.

The system in this embodiment supports an optional group that is notrelated to the determination process performed to decide whether theannotation can be applied for web contents or not. Specifically, in theweb contents, there is a group such that, even when it is dynamicallymoved, added or deleted, a change in its layout does not greatly affectthe web contents. This optional group is, for example, an advertisementobject, the location of which is changed at random each time the objectis reloaded, or a photographic object in the web contents carrying thenews article. When these groups are regarded as optional groups, whetherthe annotation can be applied or not can be appropriately determined,regardless of the presence/absence or the size of these optional groups,or a change in their locations.

The optional group can be set by adding “optional attribute” to thegroup using the XML description. For example, assume that, inpredetermined web contents, the object for a banner advertisement isdisplayed on one of the following two Xpaths:

/html[1]/body[1]/table[7]/tbody[1]/tr[1]/td[3]/table[1]/tbody[1]/tr[2]/td[1] /html[1]/body[1]/table[8]/tbody[1]/tr[1]/td[2]

In these cases, when the optional attribute is added to the two groups,the advertisement object can be set as an optional object, and can beexcluded from a group for which a determination is to be made as towhether to apply an annotation. FIG. 4A shows an example of the XMLdescription for the XPath in the first case, and FIG. 4B shows anexample of the XML description for the XPath in the second case.

FIG. 5 is a flowchart for explaining an example operation of thetranscoder 10. While referring to FIG. 3, the transcoder 10 accepts anHTTP request from the terminal device 40 (step 201), and downloads therequested web contents (target HTML) from the web server 30 (step 202).The transcoder 10 then converts the HTML of the obtained web contentsinto a DOM tree (step 203), and employs the DOM tree to issue an inquiryto the annotation database system 20 for an annotation that correspondsto the web contents (step 204).

When the transcoder 10 receives from the annotation database system 20an annotation that corresponds to the web contents, the transcoder 10first performs a required preprocess for the annotation (step 205). Theexample of the required preprocess is to exclude the optional group thatdoes not correspond to the web contents. The transcoder 10 converts theDOM tree of the web contents based on the annotation for which therequired preprocess has been completed, and initiates the transcodingoperation (step 206). As a result, the objects can be rearranged for theweb contents, or the web contents can be altered for synthesized voiceoutput. Thereafter, the transcoder 10 converts the DOM tree into HTML(step 207), and transmits the transcoded web contents to the terminaldevice 40 that originally issued the HTTP request (step 208).

FIG. 6 is a flowchart for explaining an example operation of theannotation database system 20. The operation in the annotation databasesystem 20 to which the transcoder 10 has forwarded an inquiry will nowbe described while referring to FIG. 6. When the inquiry from thetranscoder 10 is accepted by the annotation database system 20 (step301), the URL (Universal Resource Locators) of the web contents isemployed as a key to search for an annotation having a matching resource(step 302). When an annotation having a matching URL is not present, anerror message is returned to the transcoder 10 (steps 303 and 304). Inthis case, either the transcoding for the web contents is not performed,or available transcoding for which an annotation is not employed isperformed.

If the only one annotation having a matching URL is found in theannotation database system 20, this annotation is transmitted to thetranscoder 10 (steps 303, 305 and 311). If there are multipleannotations having the matching URL in the annotation database system20, an annotation for which all the XPaths match the groups (elements)in the web contents is selected (steps 305 and 306). If the only onesuch annotation is selected in the annotation database system 20, thisannotation is transmitted to the transcoder 10 (steps 307 and 311).

If multiple annotations are found in which there are matching XPaths inthe annotation database system 20, the annotation whose number ofmatched groups (elements) is the greatest is selected and transmitted tothe transcoder 10 (steps 307, 308, 309 and 311). If there are multiplesuch annotations in the annotation database system 20, the latestannotation is selected and transmitted to the transcoder 10 (steps 309,310 and 311). Through this processing, an annotation corresponding tothe inquiry issued by the transcoder 10 is selected and is used by thetranscoder 10 for the transcoding.

As is shown in FIG. 6, if an annotation that matches web content is notpresent in the annotation database system 20, an error notification isissued to the transcoder 10. If a plurality of annotations match withthe same web content, the annotation having the greatest number ofmatched groups (elements) in the web content, or the latest annotationis selected. In addition to the above described method, the conventionallayout matching technique for determining the similarity between thelayout of web contents and an assumed layout based on an annotation maybe employed to determine the similarity between the layouts, and anannotation for which the assumed layout is the most similar may beemployed first.

However, the above described states mean that there are non-matchingannotations. This will occur not only when the addition of annotationsfor web contents is not satisfactory, but also when the layouts for thetarget web contents are changed after the system has been activated, orwhen web contents having new layouts are added. Therefore, it ispreferable that, even after the system has been activated, thecorrelation between the web contents and the annotations be monitoredand adjusted as needed.

As the method for implementing the invention, when a transcoding errorhas occurred because web contents do not match any annotations, thereare web contents that match multiple annotations, or there are webcontents that are transcoded while there is text information that is notdesignated by a group, notification of the occurrence of the transcodingerror may be transmitted to the terminal device for annotation authoringto inform the author that this state exists (on-the-fly tests). Uponreceiving this notification, the author may employ a tool such as a sitepattern analyzer, which will be described later, or may employ anannotation editor to edit an annotation, so that the state whereinappropriate annotations are correlated with web contents is maintained.

As is described above, since a check is performed to determine whetherthe elements that correspond to all the XPaths included in an annotationare present in web contents to be transcoded, the appropriateness of theannotation is determined by the annotation database system 20. Thus, thelayout of the web contents can be determined in real time, and anappropriate annotation can be selected. Further, since the layout of theweb contents is directly determined, unlike the conventional techniqueaccording to which a table structure for the web contents is referred towhen determining an annotation to be used, the annotation can becontrolled to prevent an erroneous estimate such that different layoutsare regarded as being the same, or such that like layouts are regardedas being different.

In the method for managing the annotation file in the annotationdatabase system 20, a URL is employed as the first key when searchingfor a desired annotation, and thereafter, determination of anappropriate annotation depends on the layout of the web contents. Thatis, the annotation candidates to be used are roughly determined byreferring to URLs, and thereafter, the annotation to be used isspecified in accordance with the correlation between the actual layoutof the web contents to be transcoded and the description of theannotation (i.e., the XPaths of the group and the annotation).Therefore, so long as the location of web contents can be roughlydesignated, any URL can be used as a search key and the regularexpression of a restricted URL need not be designated.

In the method for managing the annotation file in the annotationdatabase system 20, a URL is employed as the first key when searchingfor a desired annotation, and thereafter, determination of anappropriate annotation depends on the layout of the web contents. Thatis, the annotation candidates to be used are roughly determined byreferring to URLs, and thereafter, the annotation to be used isspecified in accordance with the correlation between the actual layoutof the web contents to be transcoded and the description of theannotation (i.e., the XPaths of the group and the annotation).Therefore, so long as the location of web contents can be roughlydesignated, any URL can be used as a search key and the regularexpression of a restricted URL need not be designated.

The operation of adding annotations itself differs in no way from theconventional operation performed to add an annotation to web contents.Further, if there is no annotation available for use with predeterminedweb contents, instead of using an XPath wildcard to cope with thissituation, all that is necessary is for another annotation to be inputthat can be used with the web contents. Therefore, the annotationaddition process is simplified. And in addition, since an XPath wildcardneed not be taken into account, the semi-automatic generation of anXPath can be easily performed using the annotation editor. As isdescribed above, according to the embodiment, the addition and theadjustment of an annotation can be greatly simplified.

As is described above, according to the embodiment, the annotation isnot generalized using an XPath wildcard, but instead, a necessaryannotation is added to the layout of desired web contents. For example,when the “Top news” table in FIG. 2 is added or deleted, or when thelayout of the web contents is changed, even with the same URL, anannotation is generated for the individual layouts. Thus, the number ofannotation files required to transcode the same number of web contentsis increased compared with the conventional method according to whichannotations are generalized. However, an adjustment of regularexpression of URL and an XPath, which are complicated operations forwhich maintenance is difficult, can be replaced by a simple operation ofadding an available annotation to web contents for which there is nocorresponding annotation. Thus, operating costs can be reduced, andmaintenance can be simplified.

As is described above, according to the embodiment, the annotation isnot generalized using an XPath wildcard, but instead, a necessaryannotation is added to the layout of desired web contents. For example,when the “Top news” table in FIG. 2 is added or deleted, or when thelayout of the web contents is changed, even with the same URL, anannotation is generated for the individual layouts. Thus, the number ofannotation files required to transcode the same number of web contentsis increased compared with the conventional method according to whichannotations are generalized. However, an adjustment of regularexpression of URL and an XPath, which are complicated operations forwhich maintenance is difficult, can be replaced by a simple operation ofadding an available annotation to web contents for which there is nocorresponding annotation. Thus, operating costs can be reduced, andmaintenance can be simplified.

The transcoding system (the transcoder 10 and the annotation databasesystem 20) in FIG. 3 for the embodiment is provided separately from theweb server 30. However, in this embodiment, the function of thetranscoding system may be provided for the web server 30. In this case,the web server 30 comprises: web content storage means for storing webcontents; annotation file storage means that corresponds to theannotation database system 20; and web convent transcoding means thatcorresponds to the transcoder 10. Upon receiving a request from a webclient, the web contents obtained by the transcoding means istransmitted via transmission means, such as a network interface.

Even when the transcoding system of this embodiment, including theannotation database system 20, is employed, as is described above, therestill has occurred a problem in that it has been determined thatmultiple annotations can be applied for the same web contents, and aproblem in that web contents are found that do not match any annotation.The first case is resolved, during the process performed to examine theannotation database 20, by selecting an annotation having the greatestnumber of matched groups, or by selecting the latest annotation.However, to the extent possible, it is preferable that a singleannotation correspond to each set of web contents. In this embodiment,therefore, a tool, a site pattern analyzer, is provided that manages anannotation and supports the detection and resolution of the aboveproblems.

The site pattern analyzer is software (a program) for displaying thetree structure of web contents in a desirable site, and forinteractively presenting the state wherein an annotation is added. Forexample, to perform the annotation management for this embodiment, thesite pattern analyzer may be installed in a computer that implements thetranscoder 10 and the annotation database system 20, and that, as theannotation management apparatus, operates the CPU of the computer thatperforms the annotation management for this embodiment. When the sitepattern analyzer is employed, the annotation author can confirm theannotation addition state while overviewing the desired site in itsentirety, and can either add a new annotation or adjust the currentlyavailable annotation as needed. The site pattern analyzer, which issoftware, can be distributed by being stored on a storage medium, suchas a magnetic disk, an optical disk or a semiconductor memory, or bybeing transmitted across a network.

Management software of this type has also been studied for theconventional transcoding system that employs the regular expression ofURL and the XPath. However, the following problems remain, and it isdifficult to design practical management software. First, since the unitof an annotation is a group and has a different URL regular expression,it is difficult to present a list of the existing correlations betweenthe annotations and target web contents. Second, since the intergrouprelationships in the web contents are complicated by the XPaths and theregular expression, total management of the groups is difficult. In theembodiment, to resolve the first problem, the annotations are managed bya unit of web contents (a set of groups/elements), so that themanagement of annotations can be visualized using a table. To simplysolve the second problem, simply, the annotations will be managed by theunit of web contents, and for the web contents to which no annotation isadded, a new annotation will be added.

FIG. 7 is a diagram for explaining an example of the functionalarrangement of a site pattern analyzer. In FIG. 7, a site patternanalyzer 50 provided for this embodiment comprises: a matchingevaluation module 51, a tree view controller 52, an annotationcorrection module 53, a matching character string extraction module 54,and a browser/DOM tree synchronization module 55. These components aresoftware blocks the functions of which are performed by the CPU, whichis controlled by the program in the computer in which the site patternanalyzer 50 is installed. Further, the site pattern analyzer 50 preparesan annotation table 56 and a matching table 57 in the main memory of thecomputer or in the cache memory of the CPU, and employs these tables forthe processing.

At the time of activation or reloading, the thus arranged site patternanalyzer 50 travels across the entire site to be managed by anannotation, and employs input means (not shown), such as an interface,to cache the information for the web contents. At this time, the HTMLfile list concerning the web contents is also created. The site that isto be managed can be arbitrarily designated by an author.

Further, in consonance with the same timing, the site pattern analyzer50 employs input means (not shown), such as an interface, to read allthe data for the annotation files from the annotation database 20, andstores the data in the annotation table 56.

The matching evaluation module 51 receives the HTML file of the webcontents, which are cached as a processing target during the initialoperation, and the HTML file list, and also receives the data for theannotation file (hereinafter referred to as annotation data) from theannotation table 56. Then, the matching evaluation module 51 calculatesthe matching of the XPath in the annotation and the web contents, andstores the calculation results (the evaluation results) in the matchingtable 57. The evaluation results stored in the matching table 57 aredisplayed as a list for viewing on the operating screen of the sitepattern analyzer 50, which will be described later. When recalculationis required, e.g., when the site pattern analyzer 50 is activated orreloaded, the matching evaluation module 51 is called and performs therequired processing.

The tree view controller 52 receives the HTML file for the web contents,which are cached as a processing target during the initial operation,and the HTML file list, and also receives the data for the annotationfile from the annotation table 56. The tree view controller 52 thendisplays the tree for the annotations and the web contents for theentire site that is to be managed. The data output by the tree viewcontroller 52 is displayed as a tree for viewing on the operating screenof the site pattern analyzer 50, which will be described later.

The annotation correction module 53 controls the changes in theannotation data stored in the annotation table 56 due to the propertychanges or the semi-automatic corrections. And the annotation correctionmodule 53 controls the temporary caching of the annotation changes andthe changes that are reflected by the actual annotations. The matchingcharacter string extraction module 54 reads, from the matching table 57,the evaluation results obtained by the matching evaluation module 51,and receives the DOM tree of the web contents that are cached as aprocessing target during the initial operation. The matching characterstring extraction module 54 then calculates a matching character string,so that the matching details of the XPaths in the annotations and theweb contents are displayed using a character string, or so that an emptygroup, wherein there are no matched annotations, or omitted contents aredisplayed. The processing results obtained by the matching characterstring extraction module 54 are displayed as a detailed view on theoperating screen of the site pattern analyzer 50, which will bedescribed later.

The browser/DOM tree synchronization module 55 synchronizes a DOM treeconsisting of predetermined web contents with the browser view of thepredetermined web contents. And the data output by the browser/DOM treesynchronization module 55 are displayed as a browser view on theoperating screen of the site pattern analyzer 50, which will bedescribed later.

FIG. 8 is a diagram showing an example operating screen of the sitepattern analyzer 50 for the annotation management. As is shown in FIG.8, a tree view 61, a list view 62, a detailed view 63 and a browser view64 are provided for an operating screen 60.

The tree view 61 is the output of the tree view controller 52 andindicates the annotations and the tree structure of the web contents forthe entire site that is to be managed. When the author selects a desireddirectory in the tree view 61, the author can designate the directorythat includes the web contents for which the author desires to confirmthe annotation application state.

The list view 62 indicates, for the web contents included in thedirectory designated in the tree view 61, a list of the evaluationresults that are obtained by the matching evaluation module 51 and thatare stored in the matching table 57.

The list view 62 includes page titles 62 a and URLs 62 b for specifyingweb contents, the ID for specifying corresponding annotation 62 c(annotation name), the number of annotations 62 d that match the webcontents, and correlations 62 e (the presence or absence ofcorresponding elements) between the groups(elements) in the web contentsand the descriptions in the annotations. With this list view 62, theauthor can determine for which web contents an annotation should beadjusted or a new one should be added, and can also identify a requiredoperation. In addition, when the author selects desired web contents inthe list view 62, the author can designate the web contents to bedisplayed in the detailed view 63.

In this embodiment, in the process for matching web contents and anannotation, even when there is an omitted annotation, an annotation thatcan be applied for the web contents may be determined, and the remainingcontents may not be defined for transcoding (super set problem). Toavoid this problem, an annotation omission indicator string and an emptygroup indicator string (62 f) can be provided for each annotation in thelist view 62 in FIG. 8. When the total number of characters in a textnode, or the ALT attribute of an image, that is not included in anannotation for a group exceeds a predetermined number (designated by auser), the annotation omission indicator string is displayed as an alertindicating that there may be an omitted annotation. The empty groupindicator string is displayed when the contents are not included in thegroup. Since these indicator strings are displayed, the author of theannotation can easily determine which annotation should be examined orcorrected.

For these indicator strings, for example, the contents can be convertedinto a character string according to the following rules, and thecontent volume can be measured.

1. Use character string of the normal text node.

2. Use character string of the ATL attribute in case of an image.

3. Determine as a character string an image file name for an imagehaving no ALT attribute. It should be noted that when an image filename, such as spacer.gif or lxlwhite.gif, is listed in advance for animage that can not be regarded as contents, this image file name isexcluded.

4. When the number of input characters is obtained in advance bytext-text area input, employ its dummy string (e.g., xxxxxx) as acharacter string. When the input character count is not known, employ acharacter string having an arbitrary number of characters.

5. Treat “input” of “image type” in the same manner as is an image.

6. When an embedded object is present, allocate an appropriate characterstring based on the size of the object occupied on the screen.

7. When the contents, such as JavaScript, are to be dynamicallygenerated or moved, calculate a character string that does not exceed apredictable range.

The detailed view 63 is the output of the matching character stringextraction module 54, and a detailed correlation between the contents(descriptions) of an annotation and an object (element) in web contentsis displayed as a character string. With this detailed view 63, theauthor can determine whether the annotation can be correctly correlatedwith an appropriate object in the web contents.

In the browser view 64, the output of the browser/DOM treesynchronization module 55 is displayed via a browser component, and theoperation using the tree view 61, the list view 62 and the detailed view63 is reflected by the actual web contents. Thus, the author can performthe annotation management, while confirming how the transcoding based onthe annotation is actually reflected by the web contents.

In addition to the main functions as explained while referring to FIG.7, a function for semi-automatically adding a condition to the XPath, inaccordance with a designation by an author, can be provided for the sitepattern analyzer 50 in this embodiment as a function to cope with whenmultiple annotations can be applied for the same web page.

In addition to the main functions as explained while referring to FIG.7, a function for semi-automatically adding a condition to the XPath, inaccordance with a designation by an author, can be provided for the sitepattern analyzer 50 in this embodiment as a function to cope with whenmultiple annotations can be applied for the same web page.

For example, among the XPaths for the Top news table in FIG. 2, theportion that can easily be generated by the annotation editor reachesthe area up to the designation of a table.

/html[1]/body[1]/table[7]/tbody[1]/tr[1]/td[3]/table[1]

In this case, the annotation must be switched depending on whether thecharacter string is “▪Top news” or “▪Memo”. Therefore, the followingcondition must be added.

[starts-with(child::tbody[1]/tr[1]/td[1]/table[1]/tbody[1]/tr[1]/td[1],‘▪Topnews’)]”

For the common annotation editor that performs, as one operation unit,the addition of an annotation to one unit of web contents, it isdifficult to add the above condition. However, since with the sitepattern analyzer 50 multiple web contents can be browsed at the sametime, a necessary condition can be added by a semi-automatic process.

FIG. 9 is a flowchart for explaining an example of the semi-automaticprocess. In FIG. 9, when the process is initiated, first, multiple webcontents (e.g., 10 web contents) are displayed that it is determinedmatch the same annotation (step 701). An author then refers to thedetailed view 63 to select an example (a cell in a table) thatdesignates an incorrect element (step 702), and inputs an “error groupautomatic correction” command (step 703). It should be noted that thiscommand is for the same group. Then, candidates to be corrected arepresented (step 704). These candidates are listed in accordance with thefollowing conditions.

* Examine whether a candidate can be identified by the first ncharacters of a character string in a group. e.g.: /html[1]/body[1]/table[3]  →/html[1]/body[1]/table[3][starts-with(child::*, ‘old article’)]*  Examine whether a candidate can be identified by the backgroundcolor.   /html[1]/body[1]/table[3] [@bgcolor=‘CCCCCC’] * Search for anode that is included in either one.   /html[1]/body[1]/table[3][child::tbody[1]/tr[2]/td[1]/ img[1]

Thereafter, the author selects an appropriate candidate to be corrected(step 705). In addition to this candidate selection method, a method maybe mounted for employing a wizard type to perform the process step bystep.

An explanation will now be given for the processing during which theabove described site pattern analyzer 50 is employed to add anannotation to all the web contents in the predetermined web site. FIG.10 is a flowchart for explaining the processing. As is shown in FIG. 10,first, the entire desired web site is traversed and the web contents areobtained and cached (step 801). Then, the site pattern analyzer 50 orthe annotation editor is employed to create an annotation for the cachedweb contents (step 802).

Then, the site pattern analyzer 50 analyzes the web contents cached atstep S801 and the annotation created at step 802, and displays oroutputs information concerning which annotation can be applied to whichweb contents in the web site (step 803). Subsequently, an annotation isadded to the web contents to which it is determined, by the analysis,that an annotation has not been added (step 804). In the annotationaddition process, the annotation editor may be employed to add a newannotation, or a predetermined group of conventional annotations may beestablished as an optional group to be applied for desired web contents.Further, when multiple web contents are detected for which the sameannotation is to be applied, a group to which the annotation iserroneously applied is corrected (step 805). For this correction, thesemi-automatic correction function explained while referring to FIG. 9can be employed. Through the above processing, when one annotation canbe correlated with each unit of all the web contents (or all the mainweb contents), the annotation file that is thus completed is uploaded tothe transcoding system (step 806).

In the above example, it is assumed that the annotation editor that isprovided separate from the site pattern analyzer 50 is employed for thecreation of an annotation. However, the function of the annotationeditor may be provided for the site pattern analyzer 50. In this case,when, as will be described later, the annotation omission is detectedfor each view that is the output of the site pattern analyzer 50, orwhen an annotation is to be added to the web contents to which noannotation has been added, the annotation editing function of the sitepattern analyzer 50 can be employed to edit an annotation, without anannotation editor being required.

As is described above, according to the present invention, an annotationcan be correctly employed for multiple web contents, and the workloadrequired for the addition of an annotation due to the transcoding can bereduced considerably. According to the present invention, a tool can beprovided for simplifying the operation for adding an annotation to webcontents.

1. A system for transcoding digital content, the system comprising: atleast one processor; a database system for storing annotations to beused in a transcoding process, said annotation data is stored inannotation files that are prepared for units of digital contents, theannotation data including descriptions for the transcoding process, thedescriptions being correlated with a layout of elements in the digitalcontents, the database system configured to select the annotation databased on correlation between the layout of the digital content and thedescriptions of the annotations; and a transcoder for transcoding thedigital content based on a selected annotation stored in the database.2. The system according to claim 1, wherein said descriptions of theannotations include descriptions for specifying certain portions ofdigital contents.
 3. The system according to claim 2, wherein saiddescriptions for specifying certain portions of the digital contents isXPath.
 4. The system according to claim 1, wherein, if multipleannotations that can be applied to the digital content are found, thedatabase system selects the annotation that includes the descriptions ofas many elements in the digital contents as possible.
 5. A method fortranscoding digital content, the method comprising the steps of:obtaining the digital content; reading annotations to be used in atranscoding step from a database; determining an annotationcorresponding to the digital content based on correlation betweenelements in the digital content and descriptions of the annotations; andtranscoding the digital content based on the annotation determined atthe determining step.
 6. The method according to claim 5, wherein saiddescriptions of the annotations include descriptions for specifyingcertain portions of digital contents.
 7. The method according to claim6, wherein said descriptions for specifying certain portions of thedigital contents is XPath.
 8. A computer program product embodied incomputer readable medium executable on a computer for transcodingdigital content, the computer program product comprising programcomputer code for: reading annotations to be used in a transcoding stepfrom a database; determining an annotation corresponding to the digitalcontent based on correlation between a layout of the digital content anddescriptions of the annotations; and transcoding the digital contentbased on the annotation determined at the determining step. 9-12.(canceled)
 13. A computer program product embodied in computer readablemedium executable on a computer for managing annotation data to be usedfor transcoding digital content, the computer program product comprisingprogram computer code for: evaluating a correlation between elements inthe digital content and a description in the annotation data; andpresenting a interface to show a state of the correlation between thedigital content and the annotation data, based on evaluation resultsobtained by the evaluating step.
 14. The program product according toclaim 13, wherein said interface provides, for each element of thedigital content, a list for displaying whether the description of acorresponding annotation data is present.
 15. The program productaccording to claim 13, wherein said interface provides a displaycomponent on which a detailed correlation between the description of theannotation data and the elements in the digital content is displayed.16. The program product according to claim 13, wherein said interfaceprovides a display component for accepting an entry from a user andinteractively displaying a state of the annotation data corresponding tothe digital content based on the entry.
 17. The program productaccording to claim 13, wherein said interface provides a displaycomponent for editing the annotation data.